Automatically Discovering the Number of Clusters in Web Page Datasets

نویسندگان

Zhongmei Yao

Ben Choi

چکیده

Clustering is well suited for Web mining by automatically organizing Web pages into categories each of which contains Web pages having similar contents. However, one problem in clustering is the lack of general methods to automatically determine the number of categories or clusters. For the Web domain in particular, currently there is no such method suitable for Web page clustering. In an attempt to address this problem, we discover a constant factor that characterizes the Web domain, based on which we propose a new method for automatically determining the number of clusters in Web page datasets. We discover that the measure of average inter-cluster similarity reaches a constant of 1.7 when all our experiments produced the best results for clustering Web pages. We determines the number of clusters by using the constant as the stopping factor in our clustering process by arranging individual Web pages into clusters and then arranging the clusters into larger clusters and so on until the average inter-cluster similarity approaches the constant. Having the new method described in this paper together with our new Bidirectional Hierarchical Clustering algorithm reported elsewhere, we have developed a clustering system suitable for mining the Web.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Web Pages into Hierarchical Categories

متن کامل

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...

متن کامل

Explaining Clusters with Inductive Logic Programming and Linked Data

Knowledge Discovery consists in discovering hidden regularities in large amounts of data using data mining techniques. The obtained patterns require an interpretation that is usually achieved using some background knowledge given by experts from several domains. On the other hand, the rise of Linked Data has increased the number of connected cross-disciplinary knowledge, in the form of RDF data...

متن کامل

Discovering Objects in Dynamically-Generated Web Pages

As the web grows, more and more content is being hidden from the reach of traditional search engines. In this paper, we present THOR, a scalable and efficient tool to mine objects from this hidden web. With precision and recall over 90%, THOR automatically extracts objects of interest from dynamically-generated web pages. Then customized objectidentification algorithms are applied to locate the...

متن کامل

Optimizing Membership Functions using Learning Automata for Fuzzy Association Rule Mining

The Transactions in web data often consist of quantitative data, suggesting that fuzzy set theory can be used to represent such data. The time spent by users on each web page is one type of web data, was regarded as a trapezoidal membership function (TMF) and can be used to evaluate user browsing behavior. The quality of mining fuzzy association rules depends on membership functions and since t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Automatically Discovering the Number of Clusters in Web Page Datasets

نویسندگان

چکیده

منابع مشابه

Clustering Web Pages into Hierarchical Categories

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

Explaining Clusters with Inductive Logic Programming and Linked Data

Discovering Objects in Dynamically-Generated Web Pages

Optimizing Membership Functions using Learning Automata for Fuzzy Association Rule Mining

عنوان ژورنال:

اشتراک گذاری